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Abstract 

We introduce a proximal version of dual coordinate ascent method. We demonstrate how the derived 
algorithmic framework can be used for numerous regularized loss minimization problems, including 
l\ regularization and structured output SVM. The convergence rates we obtain match, and sometimes 
improve, state-of-the-art results. 



1 Introduction 



We consider the following generic optimization problem associated with regularized loss minimization of 
linear predictors: Let X\, . . . , X n be matrices in M. dxk , let <p±, . . . , (j) n be a sequence of vector convex 
functions defined on M. k , and g(-) is a convex function defined on M. d . Our goal is to solve min^gjgd P(w) 
where 



P(w) 



n 



<tn{X> w) + \g(w) 



(1) 



and A > is a regularization parameter. We will later show how to use a solver for ([[} for several popular 
regularized loss minimization problems including l\ regularization and structured output SVM. 

Let w* be the optimum of ([T]). We say that a solution w is ep-sub-optimal if P(w) — P(w*) < ep. We 
analyze the runtime of optimization procedures as a function of the time required to find an e p-sub-optimal 
solution. 

The dual coordinate ascent (DCA) method solves a dual problem of ([T]). Specifically, for each i let 
4>* : M fc — > M be the convex conjugate of 4>i, namely, <f>* (u) = max 2gR k (z T u — 4>i{z)). Similarly we define 
the convex conjugate g* of g. The dual problem is 



max D(a) where D(a) 



1 n 

n f—' 



-OLi 




XiOLi 



(2) 



where a« is the i'th column of the matrix a, which forms a vector in M. k . The dual objective in (J2]) has a 
different dual vector associated with each example in the training set. At each iteration of DCA, the dual 
objective is optimized with respect to a single dual vector, while the rest of the dual vectors are kept intact. 
We assume that g*(-) is continuous differentiable. If we define 



w(a) = Vg*(v(a)) v(a) 



1 



i=l 



(3) 



1 



then it is known that w(a*) = w*, where a* is an optimal solution of It is also known that P(w*) = 
D(a*) which immediately implies that for all w and a, we have P(w) > D(a), and hence the duality gap 
defined as 

P{w(a)) -D(a) 

can be regarded as an upper bound on the primal sub-optimality P(w(a)) — P(w*). 

We focus on a stochastic version of DCA, abbreviated by SDCA, in which at each round we choose 
which dual vector to optimize uniformly at random. We analyze SDCA either for L-Lipschitz loss functions 
or for (l/7)-smooth loss functions, which are defined as follows. 

Definition 1. A function (pi : IR fc — > M. is L-Lipschitz if for all a, b G we have 

|&(a)-0i(6)| <L\\a-b\\ P , 

where \\ ■ \\p is a norm. 

A function (pi : M fc — > M. is (l/'y) -smooth if it is differentiable and its gradient is (l/rf-Lipschitz. An 
equivalent condition is that for all a,b 6 R, we have 

Ma) < Mb) + V<M&) T (a - b) + ~||a- bf P . 

It is well-known that if 4>i{a) is (l/7)-smooth, then <ft*(u) is 7 strongly convex w.r.t.the dual norm: for 
all u,v £ R and s £ [0, 1]: 

-tfisu + (1 - s)v) > -s4>*(u) - (1 - s)<t>*(v) + 7S(1 2 ~ S) \W ~ v\\l, 
where || • \\d is the dual norm of || • ||p defined as 

||ii||£> = sup u T v. 

\\v\\ P =l 

We also assume that g{w) is 1-strongly convex with respect to another norm || • \\p>: 
g(w + Aw) > g(w) + Vg(w) T Aw + ^||Aw||p,, 
which means that g*(w) is 1-smooth with respect to its dual norm || • \\di. Namely, 

g*(v + Av) < h(v;Av) , (4) 

where 

h(v;Av) :=g*(v) + X7g*(v) T Av + ^\\Av\\ 2 D ,. (5) 



2 Main Results 

The generic Prox-SDCA algorithm which we analyze in this paper is presented in Figure [T] The ideas are 
described as follows. Consider the maximal increase of the dual objective, where we only allow to change 
the i'th column of a. At step t, let u^ -1 ) = (An)" 1 £\ X^f' 1 ^ and let = Vg*^ 1 ^). We will 

update the i-th dual variable otf 1 = af ^ + Aai, in a way that will lead to a sufficient increase of the dual 



2 



objective. For primal variable, this would lead to the update «w = v$ l > + (An) 1 XiAai, and therefore 
wW = Vg* can also be written as 



w 



(t) 



argmax 



w T v® 



argmm 



-w n 



i=l 



Note that this particular upd ate is rather similar to the update step of proximal-gradient dual-averaging 
method in the SGD domain |Xiao 2010 1. The difference is on how a® is updated, and as we will show 
later, stronger results can be proved for the Prox-SDCA method when we run SDCA for t > n iterations 
with smooth loss functions. 

In order to motivate the proximal SDCA algorithm, we note that the goal of SDCA is to increase the 
dual objective as much as possible, and thus the optimal way to choose A«j would be to maximize the dual 
objective, namely, we shall let 



Aaj = argmax 



n 



However, for complex g*{-), this optimization problem may not be easy to solve. We will simplify this 
optimization problem by relying on Q. That is, instead of directly maximizing the dual objective function, 
we try to maximize the following proximal objective which is a lower bound of the dual objective: 



argmax 
argmax 



•-#(-(<* + AO*)) - A ( V 9 *(t;( t - 1 )) T (An)- 1 X,Aa i + ^(AnT^Ac^^ 
n V 2 



+ Aon)) - w^-^XiActi 



2Xn 



\XiAai\\ D , 



However, in general, this optimization problem is not necessarily simple to solve. We will thus also propose 



alternative update rules for Aai of the form Aoti 



slu — a 



for an appropriately chosen step size 



parameter s > and any vector u G M fc such that — u G d(j)i{Xj w^~ l >). Our analysis shows that an 
appropriate choice of s still leads to a sufficient increase in the dual objective. 

We analyze the algorithm based on different assumptions on the loss functions. To simplify the state- 
ments of our theorems, we always make the following assumptions: 

• Assume that the loss functions satisfy 

1 n 

-V<&(0)<1 and Vi,a, &(a) > . 
n ^— ' 



i=l 



Assume that max^ ||Xi|| < R, where 



IX,; I 



\Xiu\ 



sup ■ 



jy 



u^O \\U\\D 

Under the above assumptions, we have the following convergence result for smooth loss functions. 
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Procedure Prox-SDCA 



Parameters scalars A, 7 (7 can be 0), R, norms || • \\d, \\ ■ \\d' 
LetaW = 0,wW = Vg*(0) 
Iterate: for t = 1, 2, . . . , T: 
Randomly pick i 

Find Acti using any of the following options (or achieving larger dual objective than one of the options): 
Option I: 

Aa { e argmax Aai -0*(-(af _1) + Ac*;)) - ^^^"^Aa, - ^ H^Aa;^, 
Option II: 
Let u be s.t. -u G d<t>i{Xj w^'^) 



Let z = u — a. 



(t-i) 



-#(-(<*? ~ 1] + sz)) - swW 1 X iZ - 



\\X-y\\ 2 
2Xn \\^ Z \\D' 



Lets = argmax sg[01] 

Set Aaj = sz 
Option III: 

Same as Option II but replace the definition of s as follows: 

_ ^(X7 W ^-l))+^(- a f- 1 ')+ W ^^) T X t q< t - 1 ) + 2|| z ||| ) 
lkll^(7+II^H 2 /(An)) 

Option IV: 

Same as Option III but replace ||Xj|| 2 in the definition of s with R? 

May also replace \\z\\ 2 D with an upper bound no larger than 4L 2 for L-Lipschitz non-smooth loss 
Option V (only for smooth losses): 



Let s 



Set Aa, 



A/17 



R 2 +Xni 

a (t) <- a (t-i) + Aa^e, 



(-V^^H)) - af _1) ) 



(*) . 
(*) 



- Vg*(v 



Output (Averaging option): 

-Ta ^1= 



Let a — 1 V J nn <t(* _1 ) 
i^ei « — t _ Tq l^i=T +i a 



(t-i) 



Let w = w(a) = 5^ 2^=t„+i ™ 
return u) 
Output (Random option): 

Let a = and w = for some random t €. To + 1, . . . , T 
return w 



Figure 1 : The Generic Proximal Stochastic Dual Coordinate Ascent Algorithm 
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Theorem 1. Consider Procedure Prox-SDCA. Assume that (pi is (I/7)- smooth for all i. To obtain an 
expected duality gap ofK[P(w^) — D(a^)] < ep, it suffices to have a total number of iterations of 

T>(n+^-) log((n+ i? ~ ' ■ 



A 7 J 1U 6U" A 7 ^ e F '- 

Moreover, to obtain an expected duality gap ofE[P(w) — D{a)\ < ep, it suffices to have a total number of 
iterations of 

r °-( n+ lo ^+S)-(^W)- 

The linear convergence result in the above theorem is faster than the corresponding proximal SGD result 
when T S> n. This indicates the advantage of Proximal SDCA approach when we run more than one pass 
over the data. Similar results can also be found in |Collins et al.| [2008] , |Le Roux et al.| [2012] , |Shalev 



Shwartz and Zhang [2012 ] but in more restricted settings than the general problem considered in this paper. 
Unlike traditional batch algorithms (such as proximal gradient descent, or accelerated proximal gradient 
descent) that can only achieve relatively fast convergence when the condition number 1/(A7)) = 0(1), our 
algorithm allows relatively fast convergence even when the condition number 1/(A7)) = 0(n), which can 
be a significant improvement for real applications. 

For nonsmooth loss functions, the convergence rate for Prox-SDCA is given below. 

Theorem 2. Consider Procedure Prox-SDCA. Assume that (pi is L-Lipschitzfor all i. To obtain an expected 
duality gap ofK[P(w) — D{a)\ < ep, it suffices to have a total number of iterations of 

T>T + n+ 4 ( f > max(0, \n log(0.5An( J RL)- 2 )] ) + n + 2 ° [ RL ^ . 

Aep Aep 

Moreover, when t > To, we have dual sub-optimality bound ofK[D(a*) — D(a^)] < ep/2. 

The result shown in the above theorem for nonsmooth loss is comparable to that of proximal SGD. 
However, one advantage of our result is that the convergence is in duality gap, which can be easily checked 
during the algorithm to serve as a stopping criterion. In comparison, SGD does not have an easy to imple- 



ment stopping criterion. Moreover, as discussed in Shalev-Shwartz and Zhang [2012], faster convergence 



(such as linear convergence) can be obtained asymptotically when the nonsmooth loss function is nearly 
everywhere smooth, and in such case, the practical performance of the algorithm will be superior to SGD 
when we run more than one pass over the data. 

3 Applications 

There are numerous possible applications of our algorithmic framework. Here we list three applications. 

3.1 £1 regularization assuming instances of low £ 2 norm 

Suppose our interest is to solve l\ regularization problem of the form 



mm 



1 " 

E(pi(xjw) + oiMli 



n 



(6) 



5 



with a positive regularization parameter a E M + . Assume also that R = maxj [ | ac* ] 1 2 is not too large. 
This would be the case, for example, in text categorization problems where each x% is a bag-of-words 
representation of some short document. 

Let w* be an optimal solution of ([6]) and assume^jthat \\w* ||a < B. Choose A = and 



Consider the problem: 



min P(w) :- 



1,, ll2 (7 n 

2 IHl2+ A IWl - 



1 n 

-yZ<t>i{x]w) + \g(w) 



i=l 



Then, if w is an (e/2)-approximated solution of the above it holds that 



n 



E 



(7) 



(8) 



1 n 

<j>i{x]w) + <r\\w\\i < P(w) < P{w*) + J < - Y <pi{xjw*) + 

2 n *— ' 



a\\w ||i + e 



It follows that w is an e-approximated solution to the problem (|6]). Hence, we can focus on solving ([8]) based 
on the Prox-SDCA framework. Note that if our goal is to solve a general L1-L2 regularization problem 
with a fixed A independent of e, then linear convergence can be obtained from our analysis when the loss 
functions are smooth. However, this section focuses on the case that our interest is to solve ([6]>, and thus 
A is chosen according to e. The reason to introduce an extra £2 regularization in (|7]) is because our theory 
requires g(w) to be 1-strongly convex, which is satisfied by Q with respect to the ^2-norm. 

To derive the actual algorithm, we first need to calculate the gradient of the conjugate of g. We have 



^9*( v ) = argmax 

w 

= argmin 



Mil 



, 



\ w \\l 



-\\W - V\\ 2 + j\\W\\! 



A sub-gradient of the objective of the optimization problem above is of the form w — v + %z 



z is a vector with Zi = sign(^j), where if u>i = then Zi € [- 



0, where 

1,1]. Therefore, if w is an optimal solution 



then for all i, either W{ = or Wi 



jsign(wi). Furthermore, it is easy to verify that if w is an optimal 



solution then for all i, if Wi 7^ then the sign of Wi must be the sign of Uj. Therefore, whenever wi / we 
^sign(uj). It follows that in that case we must have \vi\ > f. And, the other direction 



have that Wi = Vi 
is also true, namely, if \vi\ > f then setting Wi 



Tsign(ui) leads to an objective value of 



(a) +5(N- S )SNV 



where the right-hand side is the objective value we will obtain by setting wi = 0. This leads to the conclusion 
that 

\vi - fsign(uj) if \vi\ > f 
I o.w. 



Vig*(v) = sign(^) [|^| - f] 
The resulting algorithm is as follows: 



1 We can always take B — 1/a since by the optimality of w* we have ||to*||2 < II 1 "*!!! ^ l/cr. 
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pT~or*^Hn t*^ Prnx — SDPA fnr Tninimi7infT 


mib n^inrr n a ^ in d7b 

VI v' 1/ UO^llU (7 CLO Xll VI ' V 


Parameters 




regularization a 




target accuracy e 




5 > || || 2 (default value 5 = 1/0-) 




Run Prox-SDCA with: 




1 ' ll-D = | " |> || " \\d' = • 2> and R > maxj \\xi 


2 


A = e/B 2 




Vig*(v) = sign(ui) [|^| - f] + 





In terms of runtime, we obtain the following result from the general theory, where the notation O(-) 
ignores any log-factor. 

Corollary 1. The number of iterations required by Prox-SDCA, with g(-) as in Q, for solving ([6]) to an 
accuracy e is 

( R 2 B 2 \ 

O I n H I if Vi, (pi is (I/7) —smooth 

V e 7 / 

/ L 2 R 2 B 2 \ 
( n H ^ J if Vi, ^ is (L) — Lipschitz 

In both cases, R is an upper bound o/maxj ||xj||2 and B is an upper bound on ||w;*||2- 



Related Work 



Standard SGD requires 0(R B /e ) even in the case of smooth loss functions. Several variants of SGD, that 



leads to sparser intermediate solutions, have been proposed (e.g. |Langf ord et al.| [ 2009 1 , S halev-Shwartz and 
Tewari| (2%TT| , pSao| p0T0| , |Duchi and Singer] pU09"l , |Duchi et al.| pOTUJ ). However, all of these variants 
share the iteration bound of 0{R 2 B 2 / e 2 ), which is slower than our bound when e is small. 

Another relevant approach is the FISTA algorithm of Beck and Teboulle 1 2009 1. The shrinkage operator 
of FISTA is the same as the gradient of g* used in our approach. It is a batch algorithm using Nesterov's 
accelerated gradient technique. For smooth loss functions, FISTA enjoys the iteration bound of 



O 



RB 

v/e7 



However, each iteration of FISTA involves all the n examples rather than just a single example, as our 
method. Therefore, the runtime of FISTA would be 



O dn 



RB 



In contrast, the runtime of Prox-SDCA is 



R 2 B 2 

e7 



7 



which is better when n 3> This happens in the statistically interesting regime where we usually 

choose e larger than f2(l/n 2 ) for machine learning problems. In fact, since the generalization performance 
of a learning algorithm is in general no better than 0(l/n), there is no need to choose e = o(\/n). This 
means that in the statistically interesting regime, Prox-SDCA is superior to FISTA. 

Another approach to solving ([6]) when the loss functions are smooth is stochastic coordinate descent 



over the primal problem. Shalev-Shwartz and Tewari 1 201 1 ] showed that the runtime of this approach is 



O 



dnB' 



under the assumption that ||xj||oo < 1 for alii Similar results can also be found in Nesterov |2012| 
For our method, each iteration costs runtime 0(d) so the total runtime is 



0[d[n + 



R 2 B 2 



where R = maxj 1 1 xzr^ 1 1 2 - Since the assumption ||:Ei||oo < 1 implies R 2 < d, this is similar to the guarantee of 
Shalev-Shwartz and Tewari 1 201 1 [ in the worst-case. However, in many problems, R 2 can be a constant that 



does not depend on d (e.g. when the instances are sparse). In that case, the runtime of Prox-SDCA becomes 
O (d(n + B 2 /e)), which is much better than the runtime bound for the primal stochastic coordinate descent 
method given in Shalev-Shwartz and Tewari [2011 1. 



3.2 i\ regularization with low instances 

Next, we consider ([6]) but now we assume that R = maxj [|iCi ||oo is not too large (but maxj ||xj||2 might be 



large). This is the situation considered in Shalev-Shwartz and Tewari |2011 1. 



Let w* be an optimal solution of ([6]) and assum 



that \\w*h < B. Choose A 



31og(d)B 2 



and 



31og(d) 2 a 
9{w) = — - — + -^\\w\\i 



(9) 



where q = wgjz^ - The function g(w) is 1-strongly convex with respect to the norm || • ||i over Mr (see 
for example [Kakade et al. |2012|). Consider the problem ([8]) with g(-) being defined in (|9]>. As before, if 
w is an (e/2)-approximated solution of the above problem then it is also an e-approximated solution to the 
problem ([6]). Hence, we can focus on solving ((8) based on the Prox-SDCA framework. 

To derive the actual algorithm, we need to calculate the gradient of the conjugate of g. We have 



^9*( v ) = argmin 



T . 31og(d) 2 a ' 

-W V-\ \\w\\ q + 

2 A 



The i'th component of a sub-gradient of the objective of the optimization problem above is of the form 



3log(d)sign(wi)\wi\ g 1 



w 



13-2 



+ 



a 



2 We can always take B — 1/a since by the optimality of w* we have \\w * || i < 1/a. 



8 



where Z{ = sign(mj) whenever W{ ^ and otherwise Z{ E [—1,1]. Therefore, if w is an optimal solution 
then for all i, either Wi = or 



.19-1 



sign(wj) 



19-2 



31og(d) 



A 



sign(wj) 



19-2 



cr 



31og(d) V Slgn ^ j ^-A 



Furthermore, it is easy to verify that if w is an optimal solution then for all i, if to, 7^ then the sign of tUj 
must be the sign of Uj. Therefore, whenever to; ^ we have that 



Wi 



9-1 



\w 



19-2 



31og(d) 



It follows that in that case we must have \ v-i\ > ?. And, the other direction is also true, namely, if\vi\ > j 
then Wi must be non-zero. This is true because if \vi\ > f, then the i'th coordinate of any sub-gradient of 
the objective function at any vector w s.t. w\ = is — Vi + jZi ^ 0. Hence, u; can't be an optimal solution. 
This leads to the conclusion that an optimal solution has the form 



otherwise 



(10) 



where 



|V<7*(t>) 



ir 2 



1 



31og(d) 31og(d) 
which yields 




1 



9 

9-1 



2-2 
9 



31og(d) 

The resulting algorithm is as follows: 




1 

q-l 



2-2 

at' 1 
31og(d) 



g-2 \ 9-1 
9 




9-2 
9 



(11) 



Procedure Prox-SDCA for minimizing Q using g as in (|9]> 

Parameters 

regularization cr 
target accuracy e 
dimension d 

B > ||to*||i (default value B = \/a) 
Run Prox-SDCA with: 

II ' ll-D = I ■ |> II • ||d' = II • ||oo» and R > maxj ||xj||oo 



A 



31og(d)_B 2 



Vc/*(u) according to ( |T0| ) and ( [TT] ) 



In terms of runtime, we obtain the following 



9 



Corollary 2. The number of iterations required by Prox-SDCA, with g as in Q, for solving (|6]> to accuracy 
e is 



, L 2 fi 2 £ 2 log(ri) 
O n H ^ 



if Vi, 0j is (I/7) —smooth 
if Vi, is (L) — Lipschitz 



In both cases, R = maxj \\xi ||oo B is an upper bound over \\w* 



Related work 



The algorithm we have obtained is similar to the Mirror Descent framework Beck and Teboulle [2003] and 



its online or stochastic versions (see for example Shalev-Shwartz |2011 1 and the references therein). It is 



also closely related to the SMIDAS and COMID algorithms Shalev-Shwartz and Tewari [2011 ] as well as 



to dual averaging Xiao] | 2010| . Comparing the rates of these algorithms to Prox-SDCA, we obtain similar 
differences as in the previous subsection, only now B is a bound on ||w*||i rather than ||u>*||2 and R is a 
bound on maxj Halloo rather than maxj llxJU. 



3.3 Multiclass categorization and structured prediction 

In structured output problems, there is an instance space X and a large target space y . There is a function 
ifi : X x y — > R d . We assume that the range of ip is in the £2 ball of radius R of R d . The prediction of a 
vector w G M d is 

argmaxw T ip(x,y) . 
yey 

There is also a function 5 : y x 3^ — > M + which evaluates the cost of predicting a label y' when the true label 
is y. We assume that <5(y, y) = for all y. The generalized hinge-loss defined below is used as a convex 
surrogate loss function 

max 5{y ,y) — w 1 \p(x,y) + w 1 \p(x,y ; ) 
y' L 

The optimization problem associated with learning w is now 



mm 

w 



A H 112 

— \\w\\n 
2 II 112 



+ - V] ( max5(y',yi) - w T ifj(xi,yi) + w T i(j(xi,y') J 



(12) 



The above optimization problem can be cast in our setting as follows. W.l.o.g. assume that y = 
{1, . . . , k}. For each i and each j, let the j'th column of Xj be ip(xi,j). Define, 

cj>i(v) = max (5(j, yi) - % + vj) . 

Finally, let g(w) = \ Then, ( fT2] ) can be written in the form of ([T]). 

To apply the Prox-SDCA to this problem, note that g is 1-strongly convex w.r.t. || • H2 and that fa is 
2-Lipschitz w.r.t. norm || • W^. Indeed, given vectors u, v, let j be the index that attains the maximum in the 
definition of fa(v), then 



fa{v) - fa(u) < (S(j, yi) - v yi + vj) - (5(j, yi) - u Vi + Uj) < 2\\v - u\ 
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Therefore || • \\d = || • ||i and || • = \\ • lb- If we let 

R > max\\ip(xi,j)\\ 2 , 

3 

then we have that 

\\Xi\\ = sup ^ ^ 2 = sup \\Xiu\\2 = max\\ilj(xi,j)\\2<R. 

u^O \\U\\l u:||u||i = l 

To calculate the dual of 0j, note that we can write 4>i as 

<f>i = max V (<5(j, y { ) - v Vi + v^) , 

where A k = {/3 : ^ Pj < 1; Pj > 0} is the non-negative simplex of M fe . Hence, the dual of is 

0*(a) = max t> T a — </>i(u) 
t> L J 

v T a - Pj ( S tii Vi) ~ % + v 3 



max mm 

v p 



mm max 



a -^2Pj( S (^yi)- v yi + 



mm 



P T 8(-,yi) + max 



\a-P) + v yi J2\\Ph 



The inner maximization over v would be oo if for some j / y. t we have aj / Pj. Otherwise, if for all j / y, L 
we have aj = Pj the inner objective becomes 

% i a m -Pyi + Yl = % + a i) • 

Therefore, the objective would again be oo if a Vi / — Ej^ a j- m a ^ other cases, the objective is zero. 
Overall, this implies that: 



#(a) 



Ej ocjSU, Vi) if Ej <*i = A Vj / y,, a, > A E j#w «j < 1 



oo 



o.w. 



Finally, we specify Prox-SDC A (using Option IV with 2 as an upper bound of 1 1 z \ \ r>, and the random out- 
put option), and rely on the fact that a sub-gradient of 4>i(v) is a vector ej—e Vi with j G argmaxj (8(j, yi) — % +Vj) 
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Procedure Prox-SDCA for structured output learning 

Parameter scalar A 

Letw^ = ; R > maxy \\<f>(x, j)\\ 2 

Vj, = (we'll maintain wf^ = (An) -1 XiOtf^ and it/*) = £\ u^) 

V,, L>J 0) = (we'll maintain L>f ) = c/>* (a W )) 
Iterate: for t = 1,2, . . . ,T: 

Randomly pick i 

Let j G argmaxj (S(j,yi) - it/* -1 ) T (f)(xi,yi) + u/* -1 ) T c/>(xi,j)) 
LetP, = ^(X^H)) = yi ) T^.^.) + w (t-i) T^.j) 

Pi+ n('- 1 ) +A „ w (t-i) T u) (*- 1 ) 



Let s 



4R a /(An) 



uif^ <- (1 - s)wf _1) + s(Xn)~ 1 {(j}(xi,y i ) - 4>(xi,j)) 
Output: 

Return w = for some random t € To + 1, . . . , T 



Note that even if k is very large, the above implementation does not maintain a explicitly, but only 
maintains d-dimensional vectors. Therefore, we can implement the above procedure efficiently whenever 
the optimization problem involves in finding j can be performed efficiently. This is the same requirement as 
in implementing SGD for structured output prediction. 

Corollary 3. Prox-SDCA can be implemented for structured output prediction. To obtain an expected 
duality gap of at most ep, it suffices to have a total number of iterations of 

20 (2R) 2 

T > max(0, \nlog(0.5Xn(2R)- 2 )]) + n + — J '- , 

Aep 

where R is an upper bound on \\(p(xi, j)\\2- The most expensive operation at iteration t is solving 

avgmax (6{j, yi ) -w it - 1)T (p(x i ,y l ) + w {t - 1)T (j)(x l ,j)^ . (13) 



Remark 1. Since for this problem, \\z\\jj in Option IV can be bounded by I? = 4 instead of 4L 2 = 16, the 
proof of Theorem^implies that the constant 20 in Corollary^can be replaced by 5. 

Related Work 

For structured prediction problem, SGD enjoys the rate 

'R 2 



O 



Xe 



while the most expensive operation at each iteration of SGD also involves solving ([13]). Therefore, our 
bound matches the bound of SGD when n = O ( ^ j . The main advantage of our result is that it bounds 
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duality gap which can be checked in practice. Moreover, the practical convergence speed can be faster than 
what is indicated in Corollary [3] when the non-smooth loss function can be approximated by a smooth loss 
function, as pointed out in |Shalev-"S hwartz and Zhang] [20 12) . 

Recently, Lacoste-Julien et al. 1 2012] derived a stochastic coordinate ascent for structural S VM based 
on the Frank-Wolfe algorithm. Their algorithm is very similar to our algorithm and the rate they obtain for 
the convergence of duality gap matches our rate. 

Note that the generality of our framework enables us to easily handle structured output problems with 
other regularizes, such as t\ norm regularization. 



4 Proofs 



Note that the proof technique follows that of Shalev-Shwartz and Zhang [2012], but with more involved 
notations of the paper. We prove the theorems for running Prox-SDCA while choosing Aai as in Option I. 
A careful examination of the proof easily reveals that the results hold for the other options as well. More 
specifically, Lemma[l]only requires choosing Aa,i = s(uf~^ — af ^) as in ( [14] ), and Option III chooses s 
to optimize the bound on the right hand side of ( fTo} , and hence ensures that the choice can do no worse than 
the result of Lemma[T]with any s. The simplification in Option IV and V employs the specific simplification 
of the bound in Lemma[T]in the proof of the theorems. 

For convenience, we list the following simple facts about primal and dual formulations, which will be 
used in the proofs. For each i, we have 



and 



-a,-, G 



w 



Xj.at 



The key lemma is the following: 



Lemma 1. Assume that (f>* is "f-strongly-convex (where 7 can be zero). Then, for any iteration t and any 
s G [0, 1] we have 



E[D(a®) - Dia^)] > - E \P(w^) - D(a {t - l h) - ( 

11 \ 



s\2 GW 
n) ~2X 



where 



g<*> = x y 



IX: 



7(1 — s)Xn 



E 



,(*"!) 



a 



\\D 



and —U; G 



{Xjw^- 1 )). 

Proof. Since only the i'th element of a is updated, the improvement in the dual objective can be written as 

n[D(aW) - D(o (t_1) )] 
= (-<£* (-af ) - Xng* (v^ + (Xn^XiAa^ - (-</>* (-af^) - Xng* 



> 



-<j>*{-4 ] ) - An/i (yt-Q; (Xn^X.Aa^ - (-</,* (-a^) - Xng* 
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By the definition of the update we have for all s G [0, 1] that 

A = max-0*(-(af _1) + Ac*;)) - Xnh (V* _1) ; (An) _1 Xi Aai) 



An, 



> -0*(-(Qf- 1} + s(uf~ 1] - - Xnh(v^; (\n)- l sXi(u?- 1] - af _1) )). (14) 

From now on, we omit the superscripts and subscripts. Since (jf is 7-strongly convex, we have that 

0*(-(a+s(u-a))) = 0*(s(-u) + (!-«) (-a)) < s0*(-«) + (l-s)0*(-a)- J«(l-s)[|u-a||f, (15) 



Combining this with ( 14 1 and rearranging terms we obtain that 



A > -s(f)*(-u) - (1 - s)<j)*(-a) + |s(l - s)[|u - - An/i(t>; (An^sX^ - a)) 

2 

= - (1 - sU*(-a) + -s(l - s)\\u - a\\ 2 D - Xng*(v) - sw T X(u - a) - ^—\\X(u - a)f D , 

2 * 2An 



> -s(f>*(-u) - (1 - s)^*(-a) + -s(l - s)||u - a\\ 2 D - Xng*(v) - sw T X(u - a) 



2 v 1 nu x ' v ' 2An 

si , s\\Xf 



||X|| 2 ||n-a| 



D 



-s(0*(-u) + (-<f>*(-a) - \ng*{v)) +- 7(1 - s) 



An 



u — a\\ D + s(4>*(— a) + w Xa), 



s(l,(X T w) B 

where we used — u € d(f>(X T w) which yields (jf{— u) = —w T Xu — <ft(X T w). Therefore 



A — B > s 



cp(X T w) + <p*(-a) + w T Xa + 



7 (1- S ) 8 \\Xf 



2 2An 

Next note that with w = X7g*(v), we have g(w) + g*(v) = w T v. Therefore: 



\u — Oi\\ D 



(16) 



P(w) - D(a) = 1 jh UXjw) + Xg{w) - ( - 1 £ #(-04) - Xg*(v) 

71 8=1 \ 71 1=1 / 

^ n 1 " 

= - V <t>i(Xjw) + - V 0*(-Q!i) + Au> T w 
i=l i=l 

1 n 



i=i 



Therefore, if we take expectation of ( [To} w.r.t. the choice of i we obtain that 

T7. ^— » \ 



- s mA-B]>nP {a) -D(a)]-£ 



i=l 



We have obtained that 



E[D(qW) - ^(a^- 1 ))] > E[P(w ( *- 1) ) - ^(aC- 1 ))] 



2An 



Multiplying both sides by s/n concludes the proof of the lemma. 



(17) 

□ 
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We also use the following simple lemma: 
Lemma 2. For all a, D(a) < P(w*) < P(0) < 1. In addition, D(0) > 0. 

Proof. The first inequality is by weak duality, the second is by the optimality of w* , and the third by the 
assumption that n~ 1 '^2 i fa(0) — 1- F° r the last inequality we use — 4>*(0) = — max 2 (0 — fa(z)) = 
min 2 fa(z) > 0, which yields D(0) > 0. □ 

4. 1 Proof of Theorem U 

Proof ofTheo rem [7] The assumption that fa is (l/7)-smooth implies that <fi* is 7-strongly-convex. We will 
apply Lemmajljwith s = R ^1 ni £ [0, 1]. Recall that ||Xj|| < R. Therefore, the choice of s implies that 



1 Y 112 7(1 ~ s)\n < r2 J. -.s 



R 2 - R 2 = 



s s/(Xn r f) 
and hence < for all t. This yields, 

E[T>(a W ) - D{a {t -^)) > - E[P(w^) - D(a^ _1 J)] 



But since eg _1) := D{a*) -D(a^) < P{w^) - D{a^) and D(a®) - D(a^) = e^ 1} -eg } , 
we obtain that 

A7t 



E[eg>] < (1 - A) E^] < (1 - £)*E[cg>] < (1 - * ) 4 < «p(-*t/») = exp (- 



R 2 + X-fn 



This would be smaller than ep> if 
It implies that 



t > [n + 



ill 

At 



log(l/er.) • 



E[P(w 



(18) 



.(*) 



So , requiring < we obtain a duality gap of at most ep. This means that we should require 



t > [n + 



At 



log((n + 



A7 ' ep ' 



which proves the first part of Theorem [T] 



Next, we sum ( pL8[ ) over i = To , . . . , T — 1 to obtain 

T-l 



E 



r _^£(^ (t) )-^(« w )) 



t=T 



< 



n 



■E[D(a 



L»(a (To) )]. 



Now, if we choose to, a to be either the average vectors or a randomly chosen vector over t € {To + 
1, . . . , T}, then the above implies 



E[P(w) - D(a)\ < 



■E[T)(a( T )) - D(a {To) )] < 



s(T — To) ; v ' J " s(T - T 

It follows that in order to obtain a result of K[P(w) — D(a)] < ep, we only need to have 



s(T - T )e P (T - T )e P 



n 



n + 



At 



This implies the second part of Theorem [T] and concludes the proof. 



□ 
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4.2 Proof of Theorem H 

Next, we turn to the case of Lipschitz loss function. We rely on the following lemma. 

Lemma 3. Let <p : M. k — >• M. be an L-Lipschitz function w.r.t. a norm \\ ■ \\p and let \\ ■ \\d be the dual norm. 
Then, for any a £ M. k s.t. \\a\\r) > L we have that <fi*(ct) = oo. 

Proof. Fix some a with \\a\\r) > L. Let xo be a vector such that ||xo||p = 1 and a T xo = \\a\\D (this is a 
vector that achieves the maximal objective in the definition of the dual norm). By definition of the conjugate 
we have 

4>*(a) = sup[a T x — (p(x)] 

X 

> -(f)(0) + sup[a T x - (4>(x) - (f)(0))] 

X 

> -(f)(0) + sup[a T x - L\\x - 0\\ P ] 

X 

> —(f>(0) + sup[a T (cxo) — L||cxo||p] 

c>0 

= — (f)(0) + sup(||a||D — L) c = oo . 

c>0 

□ 

A direct corollary of the above lemma is: 

Lemma 4. Suppose that for all i, (f>i is L-Lipschitz w.r.t. || • ||p. Let G® be as defined in Lemma^(with 
7 = 0). Then, G® < 4R 2 L 2 . 

Proof. Using Lemma [i] we know that \\af ^\\d < L, and in addition by the relation of Lipschitz and 
sub-gradients we have \\uf ^Wd < L. Combining this with the triangle inequality we obtain that \\uf l ' — 
oq Ho < 4L 2 , and the proof follows. □ 

We are now ready to prove Theorem [2] 

Proof of Theorem^ Let G = max ( and note that by Lemma^we have G < 4R 2 L 2 . Lemma[TJ with 
7 = 0, tells us that 

E[D(a®) - D(a^)} > - E[P(w^) - D(a^)} - (-Y , (19) 
which implies that 

We next show that the above yields 



for all t >to = max(0, \nlog(2Xne^ /G)~\ ). Indeed, let us choose s = 1, then at t = to, we have 

G 1 
2An 2 l-(l-l/n) 



M f (t) l < fl - I)* f (0) + X - < p-*/» f ( ) 4- _G_ < G_ 
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This implies that ( |20| ) holds at t = to- For t > to we use an inductive argument. Suppose the claim holds 
for t — 1, therefore 



< (1 - t) E[eg- 1> ] + S < (1 " 5) XPH^) + ii f I 
Choosing s = 2n/(2n - t + * - 1) € [0, 1] yields 



2G 



to+t— 1 / A(2n-t + 



G_ 

2n-*o+t-l 7 2A 



2G 



1 



1 



A(2n-to+t-l) V 2n-to+t-l 

2G 2n-t +t-2 
\(2n-t +t-l) 2n-to+t-l 

< 2G 2n-t +t-l 

— X(2n-t +t-l) 2n-t +t 

2G 

\(2n-t +t) ■ 



This provides a bound on the dual sub-optimality. We next turn to bound the duality gap. Summing ( fT9| ) 
over t = Tq + 1, . . . , T and rearranging terms we obtain that 



E 



T — T( 



t=T +l 



< 



n 



s(T-T ] 



■E[L>(a (T) ) -D(a (To) )] 



sG 
2An 



Now, if we choose w, a to be either the average vectors or a randomly chosen vector over t G {To + 
1, . . . , T}, then the above implies 



E[P(w) - D(a)\ < 



n 



s(T - T 



■E[D(a( T )) -D(a iTo) )} + 



sG 
2An 



If T > n + To and To > to, we can set s = n/(T — To) and combining with ( |2"0| we obtain 

E[P(w) - D(a)] < E[D(a {T) ) - D(a (T()) )] + °' 
< E[D(a*) - D(a (To) )] + 



2A(T - T ) 
G 



< 



2G 



+ 



2A(T - T ) 
G 



A(2n - t + T ) 2A(T - T ) " 



A sufficient condition for the above to be smaller than ep is that To > — 2n + to and T > Tq + jf^. It 

also implies that E[D(a*) - £>(a (To) )] < e P /2. Since we also need T > t and T - T > n, the overall 
number of required iterations can be 

T > max{t , 4G/(Ae P ) - 2n + t }, T - T > max{n, G/(Ae P )}. 

We conclude the proof by noticing that < 1 (Lemma^, which implies that to < max(0, \n log(2An/G)] ). 

□ 



G 



17 



References 

A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex opti- 
mization. Operations Research Letters, 31:167-175, 2003. 

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. 
SI AM Journal on Imaging Sciences, 2(l):183-202, 2009. 

M. Collins, A. Globerson, T. Koo, X. Carreras, and P. Bartlett. Exponentiated gradient algorithms for 
conditional random fields and max-margin markov networks. Journal of Machine Learning Research, 9: 
1775-1822, 2008. 

J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. The Journal 
of Machine Learning Research, 10:2899-2934, 2009. 

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. 
In Proceedings of the 23rd Annual Conference on Learning Theory, pages 14-26, 2010. 

S. Kakade, S. Shalev-Shwartz, and A. Tewari. Regularization techniques for learning with matrices. Journal 
of Machine Learning Research, 13:1865-1890, Jun 2012. 

S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Stochastic block-coordinate frank-wolfe opti- 
mization for structural svms. arXiv preprint arXiv: 1207. 4747, 2012. 

J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. In NIPS, pages 905-912, 
2009. 

Nicolas Le Roux, Mark Schmidt, and Francis Bach. A Stochastic Gradient Method with an Exponen- 
tial Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. arXiv preprint 
arXiv: 1202.6258, 2012. 

Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SI AM Journal 
on Optimization, 22(2):341-362, 2012. 

S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine 
Learning, 4(2): 107-194, 2011. 

S. Shalev-Shwartz and A. Tewari. Stochastic methods for 1 1-regularized loss minimization. The Journal of 
Machine Learning Research, 12:1865-1892, 2011. 

Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss mini- 
mization. arXiv preprint arXiv: 1209.1873, 2012. 

Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. Journal of 
Machine Learning Research, 11:2543-2596, 2010. 



18 



